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ABSTRACT 

Exoplanet searches using radial velocity (RV) and microlensing (ML) produce 
samples of "projected" mass and orbital radius, respectively. We present a new 
method for estimating the probability density distribution (density) of the un- 
projected quantity from such samples. For a sample of n data values, the method 
involves solving n simultaneous linear equations to determine the weights of delta 
functions for the raw, unsmoothed density of the unprojected quantity that cause 
the associated cumulative distribution function (CDF) of the projected quantity 
to exactly reproduce the empirical CDF of the sample at the locations of the 
n data values. We smooth the raw density using nonparametric kernel density 
estimation with a normal kernel of bandwidth a. We calibrate the dependence 
of cr on n by Monte Carlo experiments performed on samples drawn from a the- 
oretical density, in which the integrated square error is minimized. We scale this 
calibration to the ranges of real RV samples using the Normal Reference Rule. 
The resolution and amplitude accuracy of the estimated density improve with n. 
For typical RV and ML samples, we expect the fractional noise at the PDF peak 
to be approximately 80 For illustrations, we apply the new method to 

67 RV values given a similar treatment by Jorissen et al. in 2001, and to the 
308 RV values listed at exoplanets.org on 20 October 2010. In addition to an- 
alyzing observational results, our methods can be used to develop measurement 
requirements — particularly on the minimum sample size n — for future programs, 
such as the microlensing survey of Earth-like exoplanets recommended by the 
Astro 2010 committee. 



Subject headings: astrobiology - methods: data analysis - methods: statistical 
- techniques: radial velocity - (stars:) planetary systems - (stars:) binaries: 
spectroscopic 
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1. INTRODUCTION 

Mass m and orbital radius r are two key factors for the habitability of exoplanets. This 
is because m plays an important role in the retention of an atmosphere, and r is a key 
determinant of the surface temperature. Besides those connections to the cosmic search for 
life, the true distributions of m and r are also important for theories of the formation and 
dynamical evolution of planetary systems. Therefore, we have a variety of good reasons 
to better understand the cosmic distributions of m and r. Such improvement will involve 
learning more from measurements already made, as well as anticipating results from the 
telescopes and observing programs of the future. 

We focus on two astronomical techniques that measure projected planetary quantities. 
One is radial velocity (RV), the source of many exoplanet discoveries so far. The other is 
microlensing (ML). Astro 2010 recently recommended a ML survey for Earth-mass exoplanets 
on orbits wider than detectable by Kepler (Blandford et al. 2010). 

RV yields the unprojected value of r, from the observed orbital period and an estimate 
of the stellar mass, but for mass it can only provide m sin i — not m — where i is the usually 
unknown orbital inclination angle. Correspondingly, if the mass of the lens and the stellar 
distances are known or assumed, ML yields m, but for r it usually can provide only r sin /3 — 
not r — where (3 is the usually unknown planetocentric angle between the star and observer 
(Gaudi 2011). For the foreseeable future, only an ML survey appears to offer observational 
access to exoplanets like Earth in mass and orbital radius around Sun-like stars. A high- 
confidence estimate of the occurrence probability of such planets around nearby stars is 
critical to designing future telescopes to obtain their spectra and search for signs of hfe. 

Even when the projection angle (3 is unknown, we can use statistical methods to draw in- 
ferences from samples of the projected values about the distribution of the true, unprojected 
values in nature. This paper presents a new method to help draw those inferences. 

One goal of this paper is to introduce a science metric for the ML survey, and to 
demonstrate its use with existing RV results. The ML metric is the resolution and accuracy 
of the estimated probabihty density of the orbital radii of Earth-mass exoplanets. Such a 
science metric will be useful for setting measurement requirements, designing telescopes and 
instruments, planning science operations, and arriving at realistic expectations for the new 
ML survey program. 
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2. STARTING POINT 

Our starting point is recognizing that the projected quantity, ip = msini or rsin/3, is 
the product of two independent continuous random variables, p = m or r, and y = sinz or 
sin (3, which have densities \l/(p) and Q{y), respectively. Assuming the directions of planetary 
radius vectors and orbital poles are uniformly distributed on the sphere, 

Q{y) = ^== . (1) 

p and y have ranges < p < oo and < y < 1. The product (f = py is also a random 
variable, with the density ^{f), which we can calculate as follows. The probability density 
at a point {p,y} on the p-y plane — within the ranges of p and y — is ^'(p) Q{y), and ^(v?) 
is the integral of this product over the portion of the p-y plane where py = ip: 
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where 5z is the Dirac delta function for any variable with the normalization 

5z{z — a)dz = 1 , (3) 
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and where in the last line of Eq. [2] we have used the fact that p must always be greater than 
(p. We now change the variable (p ^ u = logip with the density V{u): 

V{u) = <l>((/?)^ = <l>(10")(ln 10)10" 
du 
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= (InlO)lO" / M/(p)Q — -dp . (4) 
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We now change the variable p — )■ x = logp, with the density TZ{x). Using the relation 
\E'(p) dp = TZ{x) dx, we have 

POO 

Viu) = Ciu) (In 10) 7^(a;)Q(10"~^■)10"~^c^a; . (5) 
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In Eq. [5l we have introduced a new factor, the completeness function C{u), to account for 
variations in search completeness due to, for example, the variation of instrumental sensitiv- 
ity with u. The clearest example is declining signal-to-noise ratio with smaller u — smaller 
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msini in the case of RV and smaller rsin/3 for ML. In this paper, we ignore completeness 
effects and assume C{u) = 1. In the case of the density estimation studies in Section 4, this 
means we have not necessarily chosen a realistic theoretical TZ{x), but realism is probably 
not necessary for the immediate purposes — achieving a valid calibration of the bandwidth 
(smoothing length) a and exploring resolution and accuracy. In the case of the analysis of 
real RV data in Section 5, we need only to remember that scientific interpretations of the 
distributions we infer for 7^(log m) will demand qualification regarding the potential effects 
of C{u), particularly for the left-hand tails of the distribution, where incompleteness due to 
low signal-to- noise ratio must be important. 

We recognized that Eq. (|5]) has a thought-provoking analogy to the case of an astronom- 
ical image (which corresponds to V), where the object (TZ) is convolved with the telescope's 
point-spread function (Q) and the field is, say, vignetted in the camera (C). Pursuing this 
analogy, we might call the form of the integral in Eq. (|2]) "logarithmic convolution." It could 
be said that we "see" the true distribution (TZ) only after it has been logarithmically con- 
volved with the projection function (Q) and modulated by the completeness function (C). 
As with image processing, we can correct "vignetting" by dividing "P by C, if we know it, and 
then "deconvolve" the result to remove the effects of Q — which in this case we know exactly. 
As with image processing, the result can be a transformation — a new, alternative, precise 
description of the sample, in the form of an estimate of the "object," the natural density TZ, 
with some systematic effects reduced or removed (but other effects possibly remaining). 

Equation ([S]) is a form of Abel's integral equation, as discussed in this general context by 
Chandrasekhar & Miinch (1950), who used the formal solution to investigate the distribution 
of true and apparent rotational velocities of stars. Later, Jorissen et al. (2001) inferred the 
distribution of exoplanet masses from 67 values of m sin i, using both the formal solution and 
the Lucy- Richardson algorithm, which is an implementation of the expectation-maximization 
algorithm that is widely used in maximum-likelihood estimation (Dempster et al. 1977). We 
present a third numerical approach to solving Eq. ([5]). 

In Section 3, we develop the new method for transforming samples of measured values 
into a raw, unsmoothed density for x. In Section 4, we discuss nonparametric density 
estimation, random deviates, and the qualitative accuracy of the estimated density. In 
Section 5, we illustrate these computations on samples of RV results — the 67 values treated by 
Jorissen et al. (2001) and the 308 values of msini available at exoplanets.org on 20 October 
2010. In Section 6, we comment on implications and future directions. 
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3. NEW METHODS 

We assume the sample of independent and identically distributed, projected values {u} 
is non-redundant and sorted in ascending order. The cardinality of the sample is n. The 
empirical cumulative distribution function (CDF) for u is 

P(n)^-V" X(w,<n) , (6) 
n ^ — ^i=i 

where X is the indicator function 

X{logical statement) = 1 if the statement is true and otherwise. (7) 
P estimates the true CDF of the projected quantity, which is 

/u 
V{u')du' . (8) 
-oo 

We can approximate 7^ by a sum of Dirac delta functions at points {x} with weights 

{w}: 

Tl{x) ^ 7^(x) = _^ Wj5{x - Xj) . (9) 
The associated approximation of P{u) is 

/U POO ^ 

InlO / J2 _ WjS{x - Xj)QilO'''-'')W'-''dxdu' (10) 
-oo Ju' 

EN r io2(«'-^i) 

w.lnlO / =n(-u' + x^du' 

= Wj (^1 - Vl - 102(«-^.)H(-U + Xj 



where 

n{-a + 6) = 1 if a < 6 and if a > 6 (11) 

is the Heaviside unit-step function. When we set = n, m = Ui, and Xj, the critically 
determined set of linear equations 

Y"^ (l-Vl- 102K-«.)if(-'Ui + Uj)) = - V" Iiuk<Ui) (12) 

can be solved for {w} by multiplying the vector that is the right side of Eq. (1120 by the 
inverse of the matrix that is the outer parentheses on the left side. 
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The resulting estimates of the raw, unsmoothed density and CDF of x are TZ{x) in 
Eq. ^ and 

R{x) = . ^'^jT^i^ ~ ^j) ; (13) 

respectively. P{u) and P'{u) produce identical results when evaluated at the sample points 
{u}. We recognize that they are not estimates but pure transformations of the sample, 
through the intermediary of the weights {w}. The same is true of TZ and R, which are sums 
of discontinuous functions. Indeed, the sum of delta functions in Eq. ^ is not particularly 
useful in itself because it conveys nothing more than the transformed sample. We must 
smooth 7^(x) in order to calculate non-zero results for the density at values of the unprojected 
quantity other than the sample points. We discuss this smoothing in the next section. 



4. NONPARAMETRIC DENSITY ESTIMATION 

Here we explore the issues associated with smoothing the raw density, 7^(x). Without 
smoothing, the form of TZ{x) in Eq. ([9]) is not very interesting or useful, because the only 
values it takes on are plus and minus infinity and zero. In order to pursue practical research 
with TZ{x) — such as comparing observations with theories, learning about possible varia- 
tions of TZ{x) with other planetary or stellar parameters, and informing the measurement 
requirements for future missions and observing programs — we need TZ{x) in the form of a 
sufficiently smoothed positive function. At the same time, we want to avoid over-smoothing, 
which might discard real detail. 

The astronomers "smoothing" is called "nonparametric density estimation" in statistics 
and other fields (see Silverman 1986, Takezawa 2006, and Wasserman 2006). Our approach, 
of convolving the raw density with a Gaussian of standard deviation a, 

^-(^) = E- T^J-^^e"^^ ' (14) 

is called "kernel density estimation with a normal kernel of bandwidth a." The standard 
statistical treatment fiows from the study of histograms of samples of independent, identically 
distributed random variables, in the limit as (a) the bin width tends to zero, (b) the count in 
any bin becomes zero or one, (c) the raw density is the sum of delta functions of equal, positive 
weight (1/n) located at the sample points, and (d) the kernel density estimate is the sum of 
identical kernel functions located at the sample points. Ours is a different case, for which a 
theory must still be developed. Our raw density [Eq. ([9])] is the sum of unequally weighted 
delta functions, and the weights include both positive and negative values. These differences 
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from the standard case occur because we are estimating density in the unprojected space — 
where the sample is represented by the unequal, positive and negative weights — rather than 
in the projected space of the sample itself. 

A considerable literature has been developed on the tasks of selecting the kernel and 
bandwidth for the standard case. Optimizing the bandwidth usually calls for defining an 
objective function (figure of merit), which is maximized or minimized. Because the true 
density is not known, by definition, the objective function must be computed from the 
sample and bandwidth alone. It is not immediately clear how the extensive work on this 
problem in the standard case applies to the current case. Therefore, we take an empirical 
approach to bandwidth selection for a normal kernel by conducting Monte Carlo experiments 
with a theoretical true density. 

We expect the optimal value of a to be well-defined, because the asymptotes of Eq. (HM 
are trivial: reversion to Eq. ([9]) for a — )■ and approaching zero everywhere as cr oo. 
Therefore, the optimal value of a must lie in between. In addition, we expect the optimal 
value to decrease with higher cardinality of the sample, n, because the increased information 
from more observations should include more information about detail. At least for cr ^ Am, 
the metric for the range of u, we expect by simple scaling that the optimal value of a for 
different problems is proportional to Au. Following Wasserman (2006; p. 135), we measure 
the range using the range metric in the Normal Reference Rule: Au = min(s, g/1.34), where 
s is the sample standard deviation and q is the interquartile range. 

As a first exploration, we study the case of a theoretical density Tl{x) comprising three 
Gaussians, with {mean, standard deviation, weight} = {0.5,0.25,2.0}, {2.0,0.5,8.0}, and 
{2.5,0.125,1.0}. In this case. Am = 0.783. For this exercise we developed a facility to 
perform the following sequence of steps: (1) create random samples {u} of cardinality n, 
where each value is drawn from the random deviate or u; and for such samples, (2) solve 
Eqs. f|T2l) for weights {w}; (3) construct the estimated density TZa{x) for any value of a using 
Eq. f|T^ : (4) compute the objective function (integrated square error) v for the closeness of 
7to-(x) to 7l{x) , where 




and (5) determine the value of a that minimizes v, which we adopt as the working definition 
of the optimal value of a. 

The random deviate for u is 





(15) 



U = log(10^Y) , 



(16) 
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where 



X = x-root (^J R{x')dx' = , 
where R{x') stands for the true or estimated density under study, where 



(17) 



Y = V2Z - 1? , (18) 

where Z is the uniform random deviate on the range 0-1, and where "x-root" in Eq. f|T7|) is 
defined as the value of x that satisfies the equation in parenthesis. 

The resuhs of experiments involving these five steps are shown in Figs. 1-6. Figure [H 
confirms the expectation that lower values of a retain the spiky original pattern of delta 
functions in Eq. ([9]), and that higher values of a reduce contrast by blurring features. As 
suggested by the progression of the colored curves in Figure [H varying a to minimize v will 
work efficiently to locate an optimal value, which for n = 100 should be somewhere in the 
range 0.1 < cr < 0.4. 

Figure [2] shows the results of minimizing v to optimize a. For each of 16 values of logn 
in the range 1-4, we prepared 24 random samples {u} of cardinality n, drawing from the 
same three- Gaussian distribution. Next, we adjusted a to minimize v in order to obtain the 
optimal value of a for each sample. Next, we determined the mean and standard deviation 
of the 24 values of optimized a at each value of n. Finally, we computed the best quadratic 
fit to these data points, which is 

a(n) = (0.56 -0.21 logn + 0.023(logr2)2) ^— . (19) 

0.783 

We use Eq. (fT9l) to compute the value of a in the remainder of this paper. 

Figures |3H6] illustrate the qualitative improvement in the resolution and accuracy of 
the amplitudes of 7^o-(n)(a;) as n increases. In this controlled experiment, with 7l{x) known, 
we can evaluate performance in two ways: absolutely, by directly comparing 7^o-(n) (x) to the 
known 7l{x), and relatively, by assessing the variation of independent realizations of TZa{n){x) 
with respect to each other. In this experiment, these alternative evaluations are shown to 
be consistent: features that repeat robustly in multiple realizations are seen to correspond 
to true characteristics of 7l{x). Meanwhile, features that do not repeat are spurious. Said 
differently, we find no robustly repeated features in 7^o-(„)(x) that are not present in TZ{x), nor 
any evidence that 7?.o-(„) (x) with sufficiently high n would fail to reproduce — to any desired 
accuracy — any feature in TZ{x), no matter how narrow or small in amplitude. 

In Figure [3|, we find that samples with n = 10 offer little information about the true 
distribution of x, with fluctuations on the scale of ~40% amplitude near the peak. In 
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Fig. 1. — For a theoretical three-Gaussian density Tl{x) (black curve), the effects of con- 
volving the raw density 7?.(x) with normal kernels of various bandwidths a. Four random 
samples {u} with n = 100 were prepared using Eq. (jlj). The four sets of weights {w} were 
determined by solving Eqs. (fT2|) . The four functions 7t(x) resulting from Eq. (j9]) produced 
four functions iZ„{x) from Eq. (fl^ with a = 0.05, 0.10, 0.2, and 0.4, which are plotted here 
in color. The red and orange curves are under-smoothed and still dominated by vestiges of 
the delta functions. The blue curve is over-smoothed: the dip between the two main peaks 
is blurred away, and the tails extend well beyond the range of 7l{x). The green curve is the 
most satisfactory of the four, accurately locating the two main peaks in TZ{x) and revealing 
the distinct minimum between them. 
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Fig. 2. — Bandwidth calibration, (T(n). Points: mean values of a that minimize the integrated 
square error, determined from 24 samples {u} for each value of n. Error bars: the standard 
deviations of the 24 values contributing to each mean. Curve: the weighted quadratic fit 
to the data points. We use this calibration, scaled by the range metric, to select a for all 
smoothing in the remainder of this paper. 
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Fig. 3. — The variation of TZaix) for n = 10, in the case of the same theoretical three- 
Gaussian distribution TZ{x) (red curve). Gray and blue curves: 25 independent, random 
realizations of 7to-(„)(x). There are no robustly repeated features in the recovered density 
that are present in the true density, except the extent and perhaps the skewness of 7l{x). 
This result suggests that RV or ML samples with n = 10 over a comparable range should 
offer little information about the true distribution of mass or orbital radius. The fractional 
amplitude noise at the peak is about ±40%. 
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Figure HI we find that samples with = 100 offer crude information about the shape of 
the true distribution, with fluctuations on the scale of ~20% amplitude near the peak. 
In Figure 0, we find that samples with n = 1000 reveal the basic structure of the true 
distribution, with fluctuations on the scale of ~10% amplitude near the peak. In Figure 
resolution has improved, noise has been reduced, and some substructure is revealed, with 
fluctuations on the scale of ~5% amplitude near the peak. 

We can summarize these experiments with the finding that the fractional noise fluctua- 
tions at the peak, expressed as a percentage of the peak height, are approximately 80 n~^°^^. 

In the following section, we study real data from RV programs, where the true function 
TZ{x) is unknown — and indeed is the main object of the research. We require a method to 
determine which features are believable in the density 7?.o-(n) (x) estimated from such data. For 
this we need the functional equivalent of a density of densities, to describe the distribution of 
T^a(n){x), which is the distribution we might estimate from multiple, statistically equivalent 
data sets drawn from nature — but which we do not have, of course. The way forward is 
to assume that the unobtainable distribution is approximately the same as the distribution 
of functions 7to-(n) (x) that we can compute from multiple samples drawn from the random 
deviate for the density 7^o-(n) (x) derived from the real data. 

Basically, this approach uses self-consistency as a measure of the fidelity of the inferred 
density. It assumes that 7^o-(„) (x) is the true density and asks whether new, random data sets 
drawn from 7^o-(„)(a;), with the same cardinality, robustly repeat the features in 7to-(„)(x). If 
so, they are probably real. If not, they are probably spurious. 

This approach to "confidence" in 7?.o-(ri,) (x) is a natural extension of the Monte Carlo 
methods for determining confidence regions described in Section 14.5 of Press et al. (1986). 
We take this approach in the next section, where we study two real samples of RV data. 

5. RV DATA 

To illustrate the method developed in Sections 3-4, we treat two samples of RV data. 
The first is the 67 values of msini treated by Jorissen et al. (2001), which were kindly 
provided to us by A. Jorissen. This sample is of particular interest because the authors 
used it to estimate the density for m using Eq. (jSj) (which is identical to their Eq. 4 if 
completeness is ignored [C{u) = 1] and the logarithmic variables are changed back to m and 
msini). Therefore, we can compare results. 

Figure [7] shows our result for 7^o-(.„)(logm), which is nearly featureless, unlike the Jorissen 



- 13 - 



0.5- 



a 



0.- 



J I I I I ml I I I I I ml I I I I I ml I i i I i ml I i i I i ml I i i I i ml I i i I mil I i i I i ml I i i I i 



Bandwidth: 

cr = 0.226 
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Fig. 4. — The variation of TZcr{x) for n = 100. Amplitude variation is reduced and resolution 
is increased. The bimodal character of Tl{x) is revealed. We see that RV or ML samples 
with n = 100 should contain crude information about the shape of the true distribution of 
mass or orbital radius. The fractional amplitude noise at the peak is about ±20%. 
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Fig. 5. — The variation of TZa{x) for n = 1000. The locations and amplitudes of the two 
main peaks are accurately recovered. Improved resolution reveals the greater width of the 
right-hand peak compared with the left-hand peak, but the dip at the top is not recovered. 
Amplitude variation is further reduced. We find that RV or ML samples with n = 1000 
should convey information about the basic structure of the true distribution of mass or 
orbital radius, but would not reveal secondary features with less than about 10% amplitude, 
which is approximately the fractional noise amplitude at the peak. 
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Fig. 6. — The variation of TZ^^x) for n = 10, 000. The amphtude accuracy and resolution of 
individual realizations are further improved, and the small dip at the top of the right peak 
is marginally resolved. The fraction amplitude of noise at the peak is about ±5%. 
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et al. results, shown in their Fig. 2. Because they do not use logarithmic variables, their 
results are simplified for m < 1. rrij, even though 31% of the sample values lie in that range. 
Nevertheless, we can recover the main features in the Jorissen et al. PDF for m > l.mj if 
we grossly under- smooth. 

The feature Jorissen et al. found most believable is the minimum between the second 
and third peaks (green arrows), in the range 10-13 mj. They applied a "jackknife" test and 
concluded that this minimum was a "robust result, not affected by the uncertainty in the 
solution." Based on our experiments in Section 4, and on our treatment of the Jorissen data 
(red and gray in Fig. [7]), all the features in our under-smoothed 7^o-(n)(logm), including the 
minimum favored by Jorissen et al., which we can reproduce, are artifacts of under-smoothing 
and spurious. 

The dip in the peak of the red curve in Figure [7] is not believable, given the significantly 
greater width of the confidence region (gray). 

The second sample we studied is the 308 values of msini listed at exoplanets.org on 
20 October 2010. This sample is documented in Table 1, and Figure M shows the results 
of our treatment. The feature favored by Jorissen et al. at 10-13 mj is also not present in 
this density, which is improved in both resolution and amplitude accuracy compared with 
n = 67. 

For the n = 308 density, the increase as m decreases from ~50 to ~lOm0 (0.16- 
0.03 mj) repeats robustly in Monte Carlo trials and is statistically significant. Interestingly, 
this minimum is near the low-m cutoff of the Jorissen data, and therefore no information 
about the bimodal character of 7^o^(n)(x) is present in the earlier sample of n = 67 RV data 
points. 

6. COMMENTS AND FUTURE DIRECTIONS 

This paper offers an improved understanding of statistical inferences regarding the den- 
sity of the unprojected quantities — mass m for RV or radial distance r for ML — from samples 
of the projected quantities, msini or rsin/3. The same treatment applies to both ML and 
RV. 

We find that that ability to confidently recognize real features in the estimated density — 
and to reject spurious ones — depends on (1) the scale of the variations compared with the 
blurring scale of Q (standard deviation of 0.18 in the logarithm), (2) Au, the metric of the 
range of the measurements, (3) the amplitudes of the features, and (4) n, the cardinality 
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Fig. 7. — Treatment of the 67 RV data points of Jorissen et al. (2001). The value a = 0.155 
comes from Eq. (fT9l) using Am = 0.491, the range metric of the data. In red, the inferred 
metric, 7^o-(n)(logm). In gray, the confidence region delineated by densities generated from 
100 statistically equivalent data sets drawn from the random deviate for 7to-(„)(logm). The 
dip in the peak of the red curve is not believable, given the significantly greater width of the 
confidence region (gray). In green, an under-smoothed version of 7^o-(logm), with cr = 0.05, 
which resembles the density inferred by Jorissen et al. (2001), for m > Imj, shown in their 
Fig. 2, particularly the peaks at ~4.0, 7.5, and 15 mj, indicated here by green arrows. These 
features are apparently artifacts of under-smoothing. 
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Fig. 8. — Treatment of the sample of 308 values of msini from RV observations listed 
at exoplanets.org on 20 October 2010. The sample is documented in Table 1. The value 
a = 0.133 comes from Eq. ( !T9l) using Am = 0.600, the approximate range of the data. In 
red, the inferred density, TZa{n)(}ogm). In gray, the confidence region dehneated by densities 
generated from 100 statistically equivalent data sets, which were drawn from the random 
deviate for 7?.o-(n)(logm). The rise of the density with decreasing m from ~50-10m^ (0.16- 
0.03 mj) repeats robustly and is statistically significant. 
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of the sample. The latter is the only factor controlled by the observer or the design of a 
mission. 

Care must be given to the completeness factor, C{u), as well as to other possible sys- 
tematic effects and biases that might affect the fidelity with which a sample of projected 
values reflects the true distribution of the unprojected quantity. For example, in the case 
of RV, we know that the detection efficiency — and therefore the completeness — decreases to 
zero as the projected mass — and therefore the signal-to-noise ratio {u divided by the noise 
amplitude) decreases. Also, the least-squares estimator for mass from astrometry is known 
to be biased at low signal-to- noise ratio, and the same can be expected for msini from RV, 
as the treatment of all Keplerian signals is basically the same for planet detection by peri- 
odogram. (Brown 2009; see also Hogg et al. 2010) Any such factors affecting ML samples 
also bear close scrutiny. Monte Carlo experiments with the random deviates and analytic 
method developed in this paper should be helpful for studying the systematic effects of C{u) 
on the density. 

Astro 2010 recommended an unbiased census of Earth-like exoplanets by WFIRST using 
the ML technique. This recommendation implies specific — but not yet defined — science 
requirements on the accuracy of the Umir) inferred from ML events by exoplanets with 
typical characteristics m ~ 1 m® and r ~ 1 AU. In turn, these science requirements will 
imply measurement requirements — particularly on n, which controls random errors, and on 
knowledge of C{u), which controls systematic errors. In turn, the measurement requirements 
will flow down to the mission design, including both spacecraft and ground operations. 
We expect that Monte Carlo studies based on the method in this paper will be helpful in 
achieving adequacy and self consistency for the ML components of the WFIRST project, 
from spacecraft to ground operations. 

We note that in the usual case where the angular factor y is not known independently. 
Ho and Turner (2011) have recently pointed out that one must assume a density for the 
unprojected quantity x in order to properly state the confidence interval for the value of 
X derived from any particular observation of the projected quantity u. This is because 
the posterior distribution of y is not the same as the prior distribution of y. The needed 
density could come from (1) a theoretical guesstimate (entailing systematic uncertainties), 
(2) a sample of x from planets with known y (transiting planets in the case of RV), or (3) a 
sample of u from planets with unknown y, using the method developed in this paper. 

The introduction of the logarithmic variables u, x, and logy in Section 2, 

u = x + \ogy , (20) 
changed the problem from one of a product of random variables to one of a sum of random 
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variables. This change creates a connection to recent statistical research on the "errors- 
in-variables" problem (Studenmayer et al. 2008; Apanasovich et al. 2009; Delaigle et al. 
2009). The goal in this problem is to estimate the density of x, which is not observable, 
from observations of u, which is a version of x contaminated by the additive, homoscedastic 
measurement error, logy, with known density. The cited papers explore non-parametric 
estimation of the density of x variously using B splines, simulation extrapolation, and local- 
polynomials. We expect that future research into density estimation for m or r from RV and 
ML samples of m sin i or r sin (3 will explore such advances in statistical research, and produce 
instructive comparisons with our approach — kernel density estimation with a normal kernel. 

We thank D. Latham, D. Spiegel, W. Traub, E. Turner, and an anonymous referee for 
their helpful comments. We thank A. Jorissen for providing the sample of 67 RV measure- 
ments of msini used in Jorissen et al. (2001). We thank Sharon Toolan for her expert 
preparation of the manuscript. 
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Table 1. The 308 Values of msini from RV at exoplanets.org on 20 October 2010 
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Table 1 — Continued 
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Table 1 — Continued 



Exoplanet 
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